Fake news is a broad term and in this project our focus will be on covid-19 tweets and online articles whether the infomation presented to the reader is fake or real. In this notebook, we will process the covid19 tweets dataset and the online articles dataset and leverage the knowledge we derived from the exploratory data analysis notebooks we created to fix some issues. Next, we will visualize the data after fixing it and conclude findings. After performing the procedure on both datasets we will merge datasets and conclude our findings in a short way.
News can be in many different forms. For example, Newspapers, Magazines, TV and radio, Internet, News agencies and Alternative media. Now, there are some businesses that try to promote their products or to advertise a product and some publish fake, attractive or charming news as 'clickbait' and we tend to be curious when it comes to unusual things and so we read / check the content. For instance, in 1835 newspaper company claimed that they have found out creates live in moon (humans with wings) here is link for full article. That as a consequence allowed the company to be more popular which allowed them to promote and advertise more. Now this is one of the reasons why would certain group of people / a person publish fake news or misleading information. However, the influnce on the reader is usually harmful. For example, it could be that fake news regards healthy food suggest or recommend some types of food for people who suffer from kind of illness and if reader followed the suggestions that could be in fact harmful for them since news were fake in the first place. There are more scenarios obviously but this is where the problem lies. Fake news in general are published for the sake of obtaining profit most of the time and readers might believe misleading information which could influence their decisions.
In this research though, the focus will be on the news of covid-19 topic (tweets) and various topics of fake, legit articles published on the internet.
After reading this notebook you should have:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import spacy
nlp = spacy.load('en_core_web_sm')
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
from wordcloud import WordCloud
from textwrap import wrap
from sklearn.feature_extraction.text import CountVectorizer
import string
covid_tweets = pd.read_csv('covid19_tweets.csv')
online_articles = pd.read_csv('fake_or_real_news.csv')
Matplotlib, Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.
Seaborn Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Numpy It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Pandas pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
re A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression
nltk NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
spacy spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.
langdetect A python library that allows you to identify the language in a given string.
wordcloud Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.
CountVectorizer CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
String Python String module contains some constants, utility function, and classes for string manipulation.
In this section of the notebook, I will take the covid 19 tweets dataset in a the following procedure:
At the end of this section, the tweets dataset should be ready to be merged with the online article dataset.
We have found out in the EDA covid 19 tweets notebook that id indicated a unique identifier for each tweet which would not help us predict whether or not a tweet is reliable. Therefore I will be dropping that column from the dataset.
covid_tweets.drop(['id'],axis=1,inplace=True)
covid_tweets.head()
| tweet | label | |
|---|---|---|
| 0 | The CDC currently reports 99031 deaths. In gen... | real |
| 1 | States reported 1121 deaths a small rise from ... | real |
| 2 | Politically Correct Woman (Almost) Uses Pandem... | fake |
| 3 | #IndiaFightsCorona: We have 1524 #COVID testin... | real |
| 4 | Populous states can generate large case counts... | real |
The dataset does not have the unique 'id' identifier and contains useful columns which we will use to predict the reliability.
covid_tweets.columns.tolist()
['tweet', 'label']
The covid19 tweets dataset contain tweet and label columns which does not have any white spaces in their column names. Therefore, this will help us later to merge with the online article dataset.
covid_tweets.isnull().sum()
tweet 0 label 0 dtype: int64
The covid-19 tweets dataset does not have any missing values. That is great because now we do not have to worry about handling missing data.
covid_tweets[covid_tweets.duplicated(keep=False)]
| tweet | label |
|---|
The covid-19 tweets dataset does not have any duplicated values.
In this sub-section I will use note we derived from covid-19 tweets exploratory data analysis notebook in order to fix text issues.
In the conclusion of covid19 tweets EDA notebook it was mentioned that the following actions would solve noticed problems:
The following note taken from the eda notebook as well:
After that we should have clean text, but in order to continue processing it we would need to remove stopwords as they appear often in text corpus while not indicating actual meaning and need to use lemmatization technique in order to bring the words from various tense forms into their base form (example. 'Went' --use lemma--> 'Go')
In NLP, models treat words like Goat and goat differently, even if they are the same. Therefore, to overcome this problem, we lowercase the words. Here, I am using the lower() function available in Python for converting text to lowercase
covid_tweets['cleaned']=covid_tweets['tweet'].apply(lambda x: x.lower())
for index,text in enumerate(covid_tweets['cleaned'][35:50]):
print('Review %d:\n'%(index+1),text)
Review 1: florida governor ron desantis botches covid-19 response - by banning corona beer in order to flatten pandemic curve. Review 2: we apologise to the government of ekiti state for this error. we remain committed to improving our quality control processes to ensure accurate and transparent reporting of cases https://t.co/c6ypex9khe Review 3: nyt invented the video of a doctor fighting coronavirus in hospital. Review 4: in may we did not break 30k cases in a day. today the south alone reported 32830. https://t.co/fgcegi3o7v Review 5: alert: americans with coronavirus symptoms are being asked to cough directly onto president trump Review 6: we launched the #covid19 solidarity response fund which has so far mobilized $225+m from more than 563000 individuals companies & philanthropies. in addition we mobilized $1+ billion from member states & other generous to support countries-@drtedros https://t.co/xgpkpdvn0r Review 7: football player cristiano ronaldo turned all his hotels into hospitals to help coronavirus patients and is paying doctors and the staff. Review 8: including that there will again be testing of asymptomatic workers involved in managed isolation and quarantine and airport and border staff. this is part of our wider surveillance measures and is expected to be operational in early july. Review 9: i don't want to do a national lockdown again.' if #coronavirus continues to 'progress' in the uk the pm says the govt will have to take "further measures" but insists that he doesn't want a "second national lockdown". read more here: https://t.co/oc6er6h6lg https://t.co/nlcysrwbbn Review 10: our daily update is published. states reported 586k tests 28k cases and 224 deaths. https://t.co/bqjpy9w8qf Review 11: #indiafightscorona new recoveries in india have exceeded the new cases for 5 consecutive days. #covid19 https://t.co/iwv0eym3hd Review 12: the top 5 states with high active caseload are also the ones which are presently reporting a high level of recoveries. https://t.co/howhfx5wpe Review 13: there are currently 4927 people in managed isolation and quarantine. our current effective capacity is 7126. this gives us an excess capacity of 2199. over the next week we are projecting 3590 arrivals and 2699 departures from our facilities. Review 14: schools are struggling to cope with a lack of #covid19 tests - with new infections increasing since it became compulsory for pupils to return. but when should you get your child tested for the virus? here's our explainer 👇 Review 15: scientists at astrazeneca complain their work on a coronavirus vaccine keeps being delayed by noddy holder ringing up to ask if it will be ready by christmas https://t.co/2lmyztnapx
The text data has all been lowered successfully.
Contractions are the shortened versions of words like don’t for do not and how’ll for how will. These are used to reduce the speaking and writing time of words. We need to expand these contractions for a better analysis of the reviews.
The following dictionary used from this link.
# Dictionary of English Contractions
contractions_dict = { "ain't": "are not","'s":" is","aren't": "are not",
"can't": "cannot","can't've": "cannot have",
"'cause": "because","could've": "could have","couldn't": "could not",
"couldn't've": "could not have", "didn't": "did not","doesn't": "does not",
"don't": "do not","hadn't": "had not","hadn't've": "had not have",
"hasn't": "has not","haven't": "have not","he'd": "he would",
"he'd've": "he would have","he'll": "he will", "he'll've": "he will have",
"how'd": "how did","how'd'y": "how do you","how'll": "how will",
"I'd": "I would", "I'd've": "I would have","I'll": "I will",
"I'll've": "I will have","I'm": "I am","I've": "I have", "isn't": "is not",
"it'd": "it would","it'd've": "it would have","it'll": "it will",
"it'll've": "it will have", "let's": "let us","ma'am": "madam",
"mayn't": "may not","might've": "might have","mightn't": "might not",
"mightn't've": "might not have","must've": "must have","mustn't": "must not",
"mustn't've": "must not have", "needn't": "need not",
"needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not",
"oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not",
"shan't've": "shall not have","she'd": "she would","she'd've": "she would have",
"she'll": "she will", "she'll've": "she will have","should've": "should have",
"shouldn't": "should not", "shouldn't've": "should not have","so've": "so have",
"that'd": "that would","that'd've": "that would have", "there'd": "there would",
"there'd've": "there would have", "they'd": "they would",
"they'd've": "they would have","they'll": "they will",
"they'll've": "they will have", "they're": "they are","they've": "they have",
"to've": "to have","wasn't": "was not","we'd": "we would",
"we'd've": "we would have","we'll": "we will","we'll've": "we will have",
"we're": "we are","we've": "we have", "weren't": "were not","what'll": "what will",
"what'll've": "what will have","what're": "what are", "what've": "what have",
"when've": "when have","where'd": "where did", "where've": "where have",
"who'll": "who will","who'll've": "who will have","who've": "who have",
"why've": "why have","will've": "will have","won't": "will not",
"won't've": "will not have", "would've": "would have","wouldn't": "would not",
"wouldn't've": "would not have","y'all": "you all", "y'all'd": "you all would",
"y'all'd've": "you all would have","y'all're": "you all are",
"y'all've": "you all have", "you'd": "you would","you'd've": "you would have",
"you'll": "you will","you'll've": "you will have", "you're": "you are",
"you've": "you have"}
# Regular expression for finding contractions
contractions_re=re.compile('(%s)' % '|'.join(contractions_dict.keys()))
# Function for expanding contractions
def expand_contractions(text,contractions_dict=contractions_dict):
def replace(match):
return contractions_dict[match.group(0)]
return contractions_re.sub(replace, text)
# Expanding Contractions in the reviews
covid_tweets['cleaned']=covid_tweets['cleaned'].apply(lambda x:expand_contractions(x))
for index,text in enumerate(covid_tweets['cleaned'][35:50]):
print('Review %d:\n'%(index+1),text)
Review 1: florida governor ron desantis botches covid-19 response - by banning corona beer in order to flatten pandemic curve. Review 2: we apologise to the government of ekiti state for this error. we remain committed to improving our quality control processes to ensure accurate and transparent reporting of cases https://t.co/c6ypex9khe Review 3: nyt invented the video of a doctor fighting coronavirus in hospital. Review 4: in may we did not break 30k cases in a day. today the south alone reported 32830. https://t.co/fgcegi3o7v Review 5: alert: americans with coronavirus symptoms are being asked to cough directly onto president trump Review 6: we launched the #covid19 solidarity response fund which has so far mobilized $225+m from more than 563000 individuals companies & philanthropies. in addition we mobilized $1+ billion from member states & other generous to support countries-@drtedros https://t.co/xgpkpdvn0r Review 7: football player cristiano ronaldo turned all his hotels into hospitals to help coronavirus patients and is paying doctors and the staff. Review 8: including that there will again be testing of asymptomatic workers involved in managed isolation and quarantine and airport and border staff. this is part of our wider surveillance measures and is expected to be operational in early july. Review 9: i do not want to do a national lockdown again.' if #coronavirus continues to 'progress' in the uk the pm says the govt will have to take "further measures" but insists that he does not want a "second national lockdown". read more here: https://t.co/oc6er6h6lg https://t.co/nlcysrwbbn Review 10: our daily update is published. states reported 586k tests 28k cases and 224 deaths. https://t.co/bqjpy9w8qf Review 11: #indiafightscorona new recoveries in india have exceeded the new cases for 5 consecutive days. #covid19 https://t.co/iwv0eym3hd Review 12: the top 5 states with high active caseload are also the ones which are presently reporting a high level of recoveries. https://t.co/howhfx5wpe Review 13: there are currently 4927 people in managed isolation and quarantine. our current effective capacity is 7126. this gives us an excess capacity of 2199. over the next week we are projecting 3590 arrivals and 2699 departures from our facilities. Review 14: schools are struggling to cope with a lack of #covid19 tests - with new infections increasing since it became compulsory for pupils to return. but when should you get your child tested for the virus? here is our explainer 👇 Review 15: scientists at astrazeneca complain their work on a coronavirus vaccine keeps being delayed by noddy holder ringing up to ask if it will be ready by christmas https://t.co/2lmyztnapx
If you compare the review 9 of the sub-section of lowering all text you would notice that it started with 'i don't' which is now expanded to 'i do not'. Therefore, we can be sure that the contractions have been expanded to proper version.
Used this link to get code for removing hyperlinks from raw text link
def remove_hyperlinks(text):
return re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)
covid_tweets['cleaned']=covid_tweets['cleaned'].apply(lambda x:remove_hyperlinks(x))
for index,text in enumerate(covid_tweets['cleaned'][35:50]):
print('Review %d:\n'%(index+1),text)
Review 1: florida governor ron desantis botches covid-19 response - by banning corona beer in order to flatten pandemic curve. Review 2: we apologise to the government of ekiti state for this error. we remain committed to improving our quality control processes to ensure accurate and transparent reporting of cases Review 3: nyt invented the video of a doctor fighting coronavirus in hospital. Review 4: in may we did not break 30k cases in a day. today the south alone reported 32830. Review 5: alert: americans with coronavirus symptoms are being asked to cough directly onto president trump Review 6: we launched the #covid19 solidarity response fund which has so far mobilized $225+m from more than 563000 individuals companies & philanthropies. in addition we mobilized $1+ billion from member states & other generous to support countries-@drtedros Review 7: football player cristiano ronaldo turned all his hotels into hospitals to help coronavirus patients and is paying doctors and the staff. Review 8: including that there will again be testing of asymptomatic workers involved in managed isolation and quarantine and airport and border staff. this is part of our wider surveillance measures and is expected to be operational in early july. Review 9: i do not want to do a national lockdown again.' if #coronavirus continues to 'progress' in the uk the pm says the govt will have to take "further measures" but insists that he does not want a "second national lockdown". read more here: Review 10: our daily update is published. states reported 586k tests 28k cases and 224 deaths. Review 11: #indiafightscorona new recoveries in india have exceeded the new cases for 5 consecutive days. #covid19 Review 12: the top 5 states with high active caseload are also the ones which are presently reporting a high level of recoveries. Review 13: there are currently 4927 people in managed isolation and quarantine. our current effective capacity is 7126. this gives us an excess capacity of 2199. over the next week we are projecting 3590 arrivals and 2699 departures from our facilities. Review 14: schools are struggling to cope with a lack of #covid19 tests - with new infections increasing since it became compulsory for pupils to return. but when should you get your child tested for the virus? here is our explainer 👇 Review 15: scientists at astrazeneca complain their work on a coronavirus vaccine keeps being delayed by noddy holder ringing up to ask if it will be ready by christmas
All the hyperlinks have been removed and the previous sample showed some hyperlinks in previous sub-sections but not anymore.
By using regular expression I can select all the hashtags that start with # or @ select them and the remove them from the text.
def remove_hashtags(text):
return re.sub("[@|#][A-Za-z0-9_]+","", text)
covid_tweets['cleaned']=covid_tweets['cleaned'].apply(lambda x:remove_hashtags(x))
for index,text in enumerate(covid_tweets['cleaned'][35:50]):
print('Review %d:\n'%(index+1),text)
Review 1: florida governor ron desantis botches covid-19 response - by banning corona beer in order to flatten pandemic curve. Review 2: we apologise to the government of ekiti state for this error. we remain committed to improving our quality control processes to ensure accurate and transparent reporting of cases Review 3: nyt invented the video of a doctor fighting coronavirus in hospital. Review 4: in may we did not break 30k cases in a day. today the south alone reported 32830. Review 5: alert: americans with coronavirus symptoms are being asked to cough directly onto president trump Review 6: we launched the solidarity response fund which has so far mobilized $225+m from more than 563000 individuals companies & philanthropies. in addition we mobilized $1+ billion from member states & other generous to support countries- Review 7: football player cristiano ronaldo turned all his hotels into hospitals to help coronavirus patients and is paying doctors and the staff. Review 8: including that there will again be testing of asymptomatic workers involved in managed isolation and quarantine and airport and border staff. this is part of our wider surveillance measures and is expected to be operational in early july. Review 9: i do not want to do a national lockdown again.' if continues to 'progress' in the uk the pm says the govt will have to take "further measures" but insists that he does not want a "second national lockdown". read more here: Review 10: our daily update is published. states reported 586k tests 28k cases and 224 deaths. Review 11: new recoveries in india have exceeded the new cases for 5 consecutive days. Review 12: the top 5 states with high active caseload are also the ones which are presently reporting a high level of recoveries. Review 13: there are currently 4927 people in managed isolation and quarantine. our current effective capacity is 7126. this gives us an excess capacity of 2199. over the next week we are projecting 3590 arrivals and 2699 departures from our facilities. Review 14: schools are struggling to cope with a lack of tests - with new infections increasing since it became compulsory for pupils to return. but when should you get your child tested for the virus? here is our explainer 👇 Review 15: scientists at astrazeneca complain their work on a coronavirus vaccine keeps being delayed by noddy holder ringing up to ask if it will be ready by christmas
The hashtags are successfully removed, you can read review 6 in previous sub-sections and above to notice that #covid19 has been removed and therefore the logic has been applied to all the tweets which means that at this point there are no hashtags.
covid_tweets['cleaned']=covid_tweets['cleaned'].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '', x))
for index,text in enumerate(covid_tweets['cleaned'][35:50]):
print('Review %d:\n'%(index+1),text)
Review 1: florida governor ron desantis botches covid19 response by banning corona beer in order to flatten pandemic curve Review 2: we apologise to the government of ekiti state for this error we remain committed to improving our quality control processes to ensure accurate and transparent reporting of cases Review 3: nyt invented the video of a doctor fighting coronavirus in hospital Review 4: in may we did not break 30k cases in a day today the south alone reported 32830 Review 5: alert americans with coronavirus symptoms are being asked to cough directly onto president trump Review 6: we launched the solidarity response fund which has so far mobilized 225m from more than 563000 individuals companies amp philanthropies in addition we mobilized 1 billion from member states amp other generous to support countries Review 7: football player cristiano ronaldo turned all his hotels into hospitals to help coronavirus patients and is paying doctors and the staff Review 8: including that there will again be testing of asymptomatic workers involved in managed isolation and quarantine and airport and border staff this is part of our wider surveillance measures and is expected to be operational in early july Review 9: i do not want to do a national lockdown again if continues to progress in the uk the pm says the govt will have to take further measures but insists that he does not want a second national lockdown read more here Review 10: our daily update is published states reported 586k tests 28k cases and 224 deaths Review 11: new recoveries in india have exceeded the new cases for 5 consecutive days Review 12: the top 5 states with high active caseload are also the ones which are presently reporting a high level of recoveries Review 13: there are currently 4927 people in managed isolation and quarantine our current effective capacity is 7126 this gives us an excess capacity of 2199 over the next week we are projecting 3590 arrivals and 2699 departures from our facilities Review 14: schools are struggling to cope with a lack of tests with new infections increasing since it became compulsory for pupils to return but when should you get your child tested for the virus here is our explainer 👇 Review 15: scientists at astrazeneca complain their work on a coronavirus vaccine keeps being delayed by noddy holder ringing up to ask if it will be ready by christmas
We can notice that there are no dots commas or any punctuations mark being displayed in the text.
The following link helped me apply a method that removes emojies from raw text link
def remove_emojis(text):
emoj = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', text)
covid_tweets['cleaned']=covid_tweets['cleaned'].apply(remove_emojis)
for index,text in enumerate(covid_tweets['cleaned'][35:50]):
print('Review %d:\n'%(index+1),text)
Review 1: florida governor ron desantis botches covid19 response by banning corona beer in order to flatten pandemic curve Review 2: we apologise to the government of ekiti state for this error we remain committed to improving our quality control processes to ensure accurate and transparent reporting of cases Review 3: nyt invented the video of a doctor fighting coronavirus in hospital Review 4: in may we did not break 30k cases in a day today the south alone reported 32830 Review 5: alert americans with coronavirus symptoms are being asked to cough directly onto president trump Review 6: we launched the solidarity response fund which has so far mobilized 225m from more than 563000 individuals companies amp philanthropies in addition we mobilized 1 billion from member states amp other generous to support countries Review 7: football player cristiano ronaldo turned all his hotels into hospitals to help coronavirus patients and is paying doctors and the staff Review 8: including that there will again be testing of asymptomatic workers involved in managed isolation and quarantine and airport and border staff this is part of our wider surveillance measures and is expected to be operational in early july Review 9: i do not want to do a national lockdown again if continues to progress in the uk the pm says the govt will have to take further measures but insists that he does not want a second national lockdown read more here Review 10: our daily update is published states reported 586k tests 28k cases and 224 deaths Review 11: new recoveries in india have exceeded the new cases for 5 consecutive days Review 12: the top 5 states with high active caseload are also the ones which are presently reporting a high level of recoveries Review 13: there are currently 4927 people in managed isolation and quarantine our current effective capacity is 7126 this gives us an excess capacity of 2199 over the next week we are projecting 3590 arrivals and 2699 departures from our facilities Review 14: schools are struggling to cope with a lack of tests with new infections increasing since it became compulsory for pupils to return but when should you get your child tested for the virus here is our explainer Review 15: scientists at astrazeneca complain their work on a coronavirus vaccine keeps being delayed by noddy holder ringing up to ask if it will be ready by christmas
Notice in review 14 there used to be this emoji 👇 and now the emoji has been removed which indicates that there are no emojis anymore in the text.
This note taken from the covid19 tweet EDA notebook.
Note: Keep in mind that the model that is going to be created should give results according to keywords of the tweets, thus digits should not matter as covid-19 figures changes and there is no way for the model I am developing to distinguish the 'fake' or 'real' tweets based on the digits.
def remove_digits(text):
return re.sub("\d+", "", text)
covid_tweets['cleaned']=covid_tweets['cleaned'].apply(remove_digits)
for index,text in enumerate(covid_tweets['cleaned'][35:50]):
print('Review %d:\n'%(index+1),text)
Review 1: florida governor ron desantis botches covid response by banning corona beer in order to flatten pandemic curve Review 2: we apologise to the government of ekiti state for this error we remain committed to improving our quality control processes to ensure accurate and transparent reporting of cases Review 3: nyt invented the video of a doctor fighting coronavirus in hospital Review 4: in may we did not break k cases in a day today the south alone reported Review 5: alert americans with coronavirus symptoms are being asked to cough directly onto president trump Review 6: we launched the solidarity response fund which has so far mobilized m from more than individuals companies amp philanthropies in addition we mobilized billion from member states amp other generous to support countries Review 7: football player cristiano ronaldo turned all his hotels into hospitals to help coronavirus patients and is paying doctors and the staff Review 8: including that there will again be testing of asymptomatic workers involved in managed isolation and quarantine and airport and border staff this is part of our wider surveillance measures and is expected to be operational in early july Review 9: i do not want to do a national lockdown again if continues to progress in the uk the pm says the govt will have to take further measures but insists that he does not want a second national lockdown read more here Review 10: our daily update is published states reported k tests k cases and deaths Review 11: new recoveries in india have exceeded the new cases for consecutive days Review 12: the top states with high active caseload are also the ones which are presently reporting a high level of recoveries Review 13: there are currently people in managed isolation and quarantine our current effective capacity is this gives us an excess capacity of over the next week we are projecting arrivals and departures from our facilities Review 14: schools are struggling to cope with a lack of tests with new infections increasing since it became compulsory for pupils to return but when should you get your child tested for the virus here is our explainer Review 15: scientists at astrazeneca complain their work on a coronavirus vaccine keeps being delayed by noddy holder ringing up to ask if it will be ready by christmas
We can notice that now there are no digits at all in all tweets. At this point we have the text cleaned.
The previous sub-section cleaned the data, but it still contains words such as 'The, is, are' that does not add much meaning to the overall corpus and it appear often so we would need to remove them so that we can derive useful information from word cloud data visualization. The words appear in different tense forms as well (past, or present) and by performing lemmatization the word is reversed to its based form according while considering the surrounding context of the word.
covid_tweets['prepared']=covid_tweets['cleaned'].apply(lambda x: ' '.join([token.lemma_ for token in list(nlp(x)) if (token.is_stop==False)]))
covid_tweets
| tweet | label | cleaned | prepared | |
|---|---|---|---|---|
| 0 | The CDC currently reports 99031 deaths. In gen... | real | the cdc currently reports deaths in general t... | cdc currently report death general discrepan... |
| 1 | States reported 1121 deaths a small rise from ... | real | states reported deaths a small rise from last... | state report death small rise tuesday southe... |
| 2 | Politically Correct Woman (Almost) Uses Pandem... | fake | politically correct woman almost uses pandemic... | politically correct woman use pandemic excuse ... |
| 3 | #IndiaFightsCorona: We have 1524 #COVID testin... | real | we have testing laboratories in india and a... | test laboratory india th august test ... |
| 4 | Populous states can generate large case counts... | real | populous states can generate large case counts... | populous state generate large case count look ... |
| ... | ... | ... | ... | ... |
| 6415 | A tiger tested positive for COVID-19 please st... | fake | a tiger tested positive for covid please stay ... | tiger test positive covid stay away pet bird |
| 6416 | ???Autopsies prove that COVID-19 is??� a blood... | fake | autopsies prove that covid is a blood clot not... | autopsy prove covid blood clot pneumonia ought... |
| 6417 | _A post claims a COVID-19 vaccine has already ... | fake | a post claims a covid vaccine has already been... | post claim covid vaccine develop cause widespr... |
| 6418 | Aamir Khan Donate 250 Cr. In PM Relief Cares Fund | fake | aamir khan donate cr in pm relief cares fund | aamir khan donate cr pm relief care fund |
| 6419 | It has been 93 days since the last case of COV... | real | it has been days since the last case of covid... | day case covid acquire locally unknown sourc... |
6420 rows × 4 columns
After the text has been cleaned, stop words and the variety of tense forms can affect the results of the model that is going to developed. For example, word like report is treated differently to its past form 'reported'. Therefore the previous step has removed stop words since they usually dont add meaning and words that mean the same thing has been reseversed to the same word and saved in column called 'prepared'
Group all fake tweets together and real tweets together
covid_tweets_group=covid_tweets[['label','prepared']].groupby(by='label').agg(lambda x:' '.join(x))
covid_tweets_group.head()
| prepared | |
|---|---|
| label | |
| fake | politically correct woman use pandemic excuse ... |
| real | cdc currently report death general discrepan... |
Create function for generating word clouds
def generate_wordcloud(data,title):
wc = WordCloud(width=400, height=330, max_words=150,colormap="Dark2").generate_from_frequencies(data)
plt.figure(figsize=(10,8))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.title('\n'.join(wrap(title,60)),fontsize=13)
plt.show()
Create matrix that indicate the importance of a word occured in the corpse and correspondingly to the reliablity.
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(analyzer='word')
data=cv.fit_transform(covid_tweets_group['prepared'])
df_dtm = pd.DataFrame(data.toarray(), columns=cv.get_feature_names())
df_dtm.index=covid_tweets_group.index
df_dtm=df_dtm.transpose()
df_dtm.sample(3)
| label | fake | real |
|---|---|---|
| ethnic | 2 | 7 |
| unstoppable | 1 | 0 |
| conclude | 1 | 2 |
Display the word cloud
for index,tweet in enumerate(df_dtm.columns):
generate_wordcloud(df_dtm[tweet].sort_values(ascending=False),tweet)
Below are the results of EDA notebook (before the pre-processing)
We can notice that now there are no numbers in the word clouds and http which was emphasised has been disappeared. In addition words such 'report', 'new', 'cases' are strong keywords for real tweet, while words such as claim, cure and lockdown are emphasized and stronger keywords for fake tweet than the version shown in the EDA notebook.
covid_tweets
| tweet | label | cleaned | prepared | |
|---|---|---|---|---|
| 0 | The CDC currently reports 99031 deaths. In gen... | real | the cdc currently reports deaths in general t... | cdc currently report death general discrepan... |
| 1 | States reported 1121 deaths a small rise from ... | real | states reported deaths a small rise from last... | state report death small rise tuesday southe... |
| 2 | Politically Correct Woman (Almost) Uses Pandem... | fake | politically correct woman almost uses pandemic... | politically correct woman use pandemic excuse ... |
| 3 | #IndiaFightsCorona: We have 1524 #COVID testin... | real | we have testing laboratories in india and a... | test laboratory india th august test ... |
| 4 | Populous states can generate large case counts... | real | populous states can generate large case counts... | populous state generate large case count look ... |
| ... | ... | ... | ... | ... |
| 6415 | A tiger tested positive for COVID-19 please st... | fake | a tiger tested positive for covid please stay ... | tiger test positive covid stay away pet bird |
| 6416 | ???Autopsies prove that COVID-19 is??� a blood... | fake | autopsies prove that covid is a blood clot not... | autopsy prove covid blood clot pneumonia ought... |
| 6417 | _A post claims a COVID-19 vaccine has already ... | fake | a post claims a covid vaccine has already been... | post claim covid vaccine develop cause widespr... |
| 6418 | Aamir Khan Donate 250 Cr. In PM Relief Cares Fund | fake | aamir khan donate cr in pm relief cares fund | aamir khan donate cr pm relief care fund |
| 6419 | It has been 93 days since the last case of COV... | real | it has been days since the last case of covid... | day case covid acquire locally unknown sourc... |
6420 rows × 4 columns
We will be using the prepared version of the text since it will be very useful for the model to learn on good data quality.
covid_tweets_tobe_merged = covid_tweets[['prepared','label']]
covid_tweets_tobe_merged
| prepared | label | |
|---|---|---|
| 0 | cdc currently report death general discrepan... | real |
| 1 | state report death small rise tuesday southe... | real |
| 2 | politically correct woman use pandemic excuse ... | fake |
| 3 | test laboratory india th august test ... | real |
| 4 | populous state generate large case count look ... | real |
| ... | ... | ... |
| 6415 | tiger test positive covid stay away pet bird | fake |
| 6416 | autopsy prove covid blood clot pneumonia ought... | fake |
| 6417 | post claim covid vaccine develop cause widespr... | fake |
| 6418 | aamir khan donate cr pm relief care fund | fake |
| 6419 | day case covid acquire locally unknown sourc... | real |
6420 rows × 2 columns
The following code will add the value 'tweet' to every row since we will need the type column later on in the preparation for prediction part.
covid_tweets_tobe_merged['type'] = 'covid_tweet'
<ipython-input-66-52be449d0391>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy covid_tweets_tobe_merged['type'] = 'covid_tweet'
covid_tweets_tobe_merged
| prepared | label | type | |
|---|---|---|---|
| 0 | cdc currently report death general discrepan... | real | covid_tweet |
| 1 | state report death small rise tuesday southe... | real | covid_tweet |
| 2 | politically correct woman use pandemic excuse ... | fake | covid_tweet |
| 3 | test laboratory india th august test ... | real | covid_tweet |
| 4 | populous state generate large case count look ... | real | covid_tweet |
| ... | ... | ... | ... |
| 6415 | tiger test positive covid stay away pet bird | fake | covid_tweet |
| 6416 | autopsy prove covid blood clot pneumonia ought... | fake | covid_tweet |
| 6417 | post claim covid vaccine develop cause widespr... | fake | covid_tweet |
| 6418 | aamir khan donate cr pm relief care fund | fake | covid_tweet |
| 6419 | day case covid acquire locally unknown sourc... | real | covid_tweet |
6420 rows × 3 columns
covid_tweets_tobe_merged.shape
(6420, 3)
We have created a subset of the covid-19 tweets which is ready to be merged with the online articles dataset via Uniform merging technique.
In this section of the notebook, I will take the online articles dataset in a the following procedure:
At the end of this section, the online articles dataset should be ready to be merged with the online covid19 tweets which have been prepared already.
We have found out in the EDA online article notebook that 'Unnamed: 0 ' indicated a unique identifier for each article which would not help us predict whether or not an article is reliable. Therefore I will be dropping that column from the dataset. In addition, since we use the text or the context to determine the reliability the title column currently would not be used. However in future version we could leverage it to use topic modelling for instance. Therefore, it will be dropped for now.
online_articles.drop(['Unnamed: 0','title'],axis=1,inplace=True)
online_articles
| text | label | |
|---|---|---|
| 0 | Daniel Greenfield, a Shillman Journalism Fello... | FAKE |
| 1 | Google Pinterest Digg Linkedin Reddit Stumbleu... | FAKE |
| 2 | U.S. Secretary of State John F. Kerry said Mon... | REAL |
| 3 | — Kaydee King (@KaydeeKing) November 9, 2016 T... | FAKE |
| 4 | It's primary day in New York and front-runners... | REAL |
| ... | ... | ... |
| 6330 | The State Department told the Republican Natio... | REAL |
| 6331 | The ‘P’ in PBS Should Stand for ‘Plutocratic’ ... | FAKE |
| 6332 | Anti-Trump Protesters Are Tools of the Oligar... | FAKE |
| 6333 | ADDIS ABABA, Ethiopia —President Obama convene... | REAL |
| 6334 | Jeb Bush Is Suddenly Attacking Trump. Here's W... | REAL |
6335 rows × 2 columns
online_articles.columns.tolist()
['text', 'label']
The dataset columns does not have white spaces which will allows to merge the datasets easily later on.
online_articles.isnull().sum()
text 0 label 0 dtype: int64
The dataset does not have any missing values which is good and we do not have to worry about handling missing data with the online articles dataset.
online_articles[online_articles.duplicated(keep=False)]
| text | label | |
|---|---|---|
| 12 | Click Here To Learn More About Alexandra's Per... | FAKE |
| 14 | Killing Obama administration rules, dismantlin... | REAL |
| 25 | Washington (CNN) The faction of the GOP that i... | REAL |
| 30 | On this day in 1973, J. Fred Buzhardt, a lawye... | REAL |
| 35 | Trump Raises Concern Over Members Of Urban Com... | FAKE |
| ... | ... | ... |
| 6227 | Email \nISIS barbarians used an industrial dou... | FAKE |
| 6233 | Email \nNorth Korea’s Foreign Ministry slammed... | FAKE |
| 6250 | A verdict in 2017 could have sweeping conseque... | REAL |
| 6270 | Killing Obama administration rules, dismantlin... | REAL |
| 6328 | FAKE |
344 rows × 2 columns
online_articles[online_articles.duplicated(keep=False)]['text'].iloc[0]
"Click Here To Learn More About Alexandra's Personalized Essences Psychic Protection Click Here for More Information on Psychic Protection! Implant Removal Series Click here to listen to the IRP and SA/DNA Process Read The Testimonials Click Here To Read What Others Are Experiencing! Copyright © 2012 by Galactic Connection. All Rights Reserved. \nExcerpts may be used, provided that full and clear credit is given to Alexandra Meadors and www.galacticconnection.com with appropriate and specific direction to the original content. Unauthorized use and/or duplication of any material on this website without express and written permission from its author and owner is strictly prohibited. Thank you. \nPrivacy Policy \nBy subscribing to GalacticConnection.com you acknowledge that your name and e-mail address will be added to our database. As with all other personal information, only working affiliates of GalacticConnection.com have access to this data. We do not give GalacticConnection.com addresses to outside companies, nor will we ever rent or sell your email address. Any e-mail you send to GalacticConnection.com is completely confidential. Therefore, we will not add your name to our e-mail list without your permission. Continue reading... Galactic Connection 2016 | Design & Development by AA at Superluminal Systems Sign Up forOur Newsletter \nJoin our newsletter to receive exclusive updates, interviews, discounts, and more. Join Us!"
The dataset does contain duplicated values and I noticed that some articles contain empty text and some lean more towards spam or advertisement which make sense to be duplicated. I also noticed that if we divided the number of duplicated articles by the total number all articles we would notice that the duplicated articles is about 5% (344 / 6335) of the whole dataset which all these factors combined helped to decide that I would be dropping these articles from the dataset.
online_articles.drop(online_articles[online_articles.duplicated(keep=False)][online_articles.duplicated(keep=False)].index,inplace=True)
<ipython-input-82-572cbbacbdc0>:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index. online_articles.drop(online_articles[online_articles.duplicated(keep=False)][online_articles.duplicated(keep=False)].index,inplace=True)
online_articles[online_articles.duplicated(keep=False)]
| text | label |
|---|
In this sub-section I will use note we derived from online articles exploratory data analysis notebook in order to fix text issues.
In the conclusion of online articles EDA notebook it was mentioned that the following actions would solve noticed problems:
The following note taken from the eda notebook as well:
After that we should have clean text, but in order to continue processing it we would need to remove stopwords as they appear often in text corpus while not indicating actual meaning and need to use lemmatization technique in order to bring the words from various tense forms into their base form (example. 'Went' --use lemma--> 'Go')
In NLP, models treat words like Goat and goat differently, even if they are the same. Therefore, to overcome this problem, we lowercase the words. Here, I am using the lower() function available in Python for converting text to lowercase
online_articles['cleaned']=online_articles['text'].apply(lambda x: x.lower())
for index,text in enumerate(online_articles['cleaned'][45:50]):
print('Review %d:\n'%(index+1),text)
Review 1: sponsors say that the shootings in garland texas confirm their view of islam as violenceprone but critics say the event was designed to be incendiary and to poison relations at a volatile time when pamela geller and her controversial organization the american freedom defense initiative announced it would hold a cartoon contest in garland texas their plan to satirize and lampoon the founder of islam was intended to have both a defiant and provocative freespeech edge sunday’s contest and its 10000 prize were prompted in part by the paris charlie hebdo massacre in january ms geller said in march as well as the riots in muslim countries sparked by the publication of satirical antimuhammad cartoons by a danish newspaper in 2005 and indeed as if on cue two gunmen with apparent ties to islamic militants overseas tried to storm the heavily secured event in a similar fashion before being shot dead by a local police officer sunday night the incident comes at a time when tensions between some segments of american society and muslims appear to be becoming more fraught – with protests against muslims in texas and antimuslim socialmedia attacks after the release of the film american sniper in that context geller is actions raise questions about speech seen by many as motivated to incite anger and hatred it is an issue geller has faced before two weeks ago she won a federal freespeech case against new york’s metropolitan transportation authority which had refused to put up one of her ads “killing jews is worship that draws us close to allah” – a quote the ad attributes to “hamas mtv” geller’s organization has often clashed with officials in other cities including philadelphia and washington over their incendiary ads some of which compare islam to nazism in 2012 another federal judge ruled that cities could not refuse to post her subway poster that read “in any war between the civilized man and the savage support the civilized man support israel defeat jihad” many supporters of geller and her organization view the violence on sunday as a vindication of their views of islam as an inherently violenceprone religion but for others her relentless campaign to push the boundaries of free speech with intentionally incendiary messages is only poisoning public discourse at a particularly volatile time “and coming as it did right when we the united states of america are really facing a time when we have to question what it is that holds us together i can see this potentially aggravating the alreadychallenging times for dealing with some of these questions about cultural difference diversity and what kind of society we want to be” says gordon coonfield director of graduate studies in communication at villanova university near philadelphia after analyzing some of the submissions to the american freedom defense initiative’s “muhammad art exhibit and cartoon contest” professor coonfield pointed out the similarities of some of the depictions of the prophet muhammad to posters for “der ewige jude” or “the eternal jew” a notorious nazi propaganda “documentary” in one of the cartoons the prophet is depicted as contorted and snarling and as a hooknosed man in a turban holding a bloody knife the caption reads “when it comes to religion i’ve got the edge” the face coonfield notes is nearly identical to the contorted face of “the eternal jew” poster “that strategy for creating a sense of ‘unity’ by lifting up this internal enemy is as old as human civilization and culture” he says “it’s ironic that the kind of thinking that hitler used and the nazis have become famous for using – propaganda to try to create this sense of a collective by creating a strong unquestionably evil other who is right here in our midst so it’s kind of ironic that she’s trying to link some of these things together when that is in fact her message” despite the fact that images depicting the prophet muhammad cut deeply to the heart of muslim identity muslim leaders in texas told their followers not to picket or protest the event on sunday “her words are not just free speech” says linda sarsour executive director of the arab american association of new york “they are inciteful they incite hate against our whole community i was very dismayed by the shooting in garland texas but at the same time pamela geller is not the victim in this situation that we’re in right now” “she intentionally put that event together in hopes that she’d get the response that she received” ms sarsour says “we prayed but not one muslim from the state of texas went out to protest her” she added “muslim leaders specifically told people do not go anywhere near her let her do whatever she does we don’t care and there was no protesting outside – unfortunately except for these two guys from arizona who were already on the radar of the fbi anyway” advocates have tried to counter geller’s free political expressions with ad campaigns of a different tone in 2012 a coalition called rabbis for human rights responded to her “support the civilized man” poster with an opposing message that read “in the choice between love and hate choose love help stop bigotry against our muslim neighbors” and last week the makers of the satirical film “the muslims are coming” launched a humorous series of subway and bus ads to counter geller is “the muslims are coming and they shall strike with hugs so fierce you’ll end up calling your grandmother and telling her that you love her” but in an era in which the islamic state the tsarnaev trial and the lingering aftermath of 911 still inflame fears about islam many worry that sunday’s violence will exacerbate the current tensions “free speech is about being open to listening to the ideas you hate the most that you disagree with the most and i feel this group in particular is hiding behind this free speech rhetoric” coonfield says “this can’t become the poster child for christianity versus islam or the west versus the middle east we have to maintain a space where groups that have very different ways of thinking and viewing the world can still come together to talk about it without resorting to this kind of craziness” Review 2: drug and substance abuse has ruined and taken the lives of many substance addiction or abuse happens to be a complicated and complex disease which gradually gnaws the addict of their physical Review 3: by amanda froelich this treelike skyscraper is capable of growing 24 acresworth of crops and will be powered entirely by renewable resources by 2050 the world’s population is estimated to reach Review 4: the world watched in shock on wednesday as french satirical publication charlie hebdo became the site of a grisly terror attack gunmen opened fire on a secondfloor editorial meeting killing 12 people in total among them were eight journalists and two police officers journalists felt their profession under fire and several newspapers are taking to their front pages to react editorial cartoons somber black covers and powerful photos from the attack are seen on pages around the world the independent covers their paper with a fictional cover of charlie hebdo libération in paris said we are all charlie the times of london is calls it attack on freedom Review 5: ying and yang the gold and silver setup posted on home » silver » silver news » ying and yang the gold and silver setup no this is not a post about some new chinese law firm instead it’s just an update on the gold and silver markets which while refusing to go further down aren’t making much progress to the upside either from craig hemke tfmetalsreport today’s message a few more slightly positive us economic datapoints and these are likely enough to make a december ff rate hike a fait accompli again though…and i can’t stress this enough… we have traced out a pattern that is remarkably similar to last october and november in the run up to the most recent ff rate hike and what happened beginning the very next day well by now you know the story the week of the october 2o15 fomc produced a high trade in the dec15 contract of 1183 as the fedlines were digested later that week it became clear that the fed was going to raise the ff rate at the december 2015 meeting come hell or high water and they did however take a close look at how gold traded in the days and weeks between the oct15 fomc and the december rate hike price fell from 1180 to 1050 in about five weeks but note that it bottomed well in advance of the actual “news” of the ff rate hike this 10 drop was fueled by a near panic level liquidation of the specs at the comex how bad was it from the cot survey of 102715 just one day before that fateful fomc and fedlines the large specs in gold were net long more than 157000 contracts while the commercials were net short nearly 166000 just five weeks later the net position of the large specs was down to only 10000 contracts with the commercial position reaching an alltime low of just 2911 contracts net short we even speculated at the time that there were some days intraweek where the gold commercials were actually and historically net long well now compare last autumn to our current situation just as back then a ff rate hike is a near certainty at the fomc in december however as you know the anticipatory move in gold began a few weeks ago with the beatdown and purposeful break of both the 50day and 100day moving averages in late september take a look at the current chart and compare it to the one posted above in 2015 we had the october fomc and then two stout down weeks before price turned we slogged through 56 weeks of consolidation and cot improvement before the blast higher began in 2016 we had the september fomc and then two stout down weeks price is attempting to bottom and turn while the cot improves but it doesn’t seem ready just yet to begin moving consistently higher in 2015 the turn in gold began once the actual rate hike took place the rate hike and forecast for 3 or 4 more in 2016 led to dollar strength which led to chinese devaluations which led to emerging market crises which led to equity selloffs and the gold price was already 510 off its lows by late january before the real fun began with the usdjpy falling 10 in early february are we headed down that same path again it certainly appears so as the first major salvos of chinese yuan devaluation were fired last week httpwwwzerohedgecomnews20161020dearjanetchinadevaluesmostaugustyuantumbleslowestsept2010 and just as in 2015 the cot is certainly undergoing a makeover too from the survey of 92716 the large specs in gold were net long 292000 contracts while the commercials were net short 325000 as of last tuesday and just three weeks later the large specs were down to 180000 net long for a reduction of 38 and the commercials were net short 203000 to be sure these are still hefty positions but much more “bullish” than the levels seen through the past summer and now check the full longterm chart you can see again the similarities between now and last fall also be sure to note however that the trend has clearly changed and that price is pointed higher so while we must still deal with the consolidation for a while longer…the ying and yang mentioned in the title of this post…it is clear to me that the trend remains higher and that the nowexpected fomc ff rate hike will be simply another “selltherumor buythenews” type of event for gold and silver this current period of relative quiet should be used to prepare for the next leg up not some sort of new bear market where paper prices are sharply falling use your time wisely and continue to preparestack accordingly tf on sale at sd bullion… this week only… this entry was posted in gold news silver news and tagged craig hemke december rate hike gold update silver update tfmetals report bookmark the permalink post navigation
We can notice that all the words in the article has been lower cased.
Contractions are the shortened versions of words like don’t for do not and how’ll for how will. These are used to reduce the speaking and writing time of words. We need to expand these contractions for a better analysis of the reviews.
The following dictionary used from this link.
# Dictionary of English Contractions
contractions_dict = { "ain't": "are not","'s":" is","aren't": "are not",
"can't": "cannot","can't've": "cannot have",
"'cause": "because","could've": "could have","couldn't": "could not",
"couldn't've": "could not have", "didn't": "did not","doesn't": "does not",
"don't": "do not","hadn't": "had not","hadn't've": "had not have",
"hasn't": "has not","haven't": "have not","he'd": "he would",
"he'd've": "he would have","he'll": "he will", "he'll've": "he will have",
"how'd": "how did","how'd'y": "how do you","how'll": "how will",
"I'd": "I would", "I'd've": "I would have","I'll": "I will",
"I'll've": "I will have","I'm": "I am","I've": "I have", "isn't": "is not",
"it'd": "it would","it'd've": "it would have","it'll": "it will",
"it'll've": "it will have", "let's": "let us","ma'am": "madam",
"mayn't": "may not","might've": "might have","mightn't": "might not",
"mightn't've": "might not have","must've": "must have","mustn't": "must not",
"mustn't've": "must not have", "needn't": "need not",
"needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not",
"oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not",
"shan't've": "shall not have","she'd": "she would","she'd've": "she would have",
"she'll": "she will", "she'll've": "she will have","should've": "should have",
"shouldn't": "should not", "shouldn't've": "should not have","so've": "so have",
"that'd": "that would","that'd've": "that would have", "there'd": "there would",
"there'd've": "there would have", "they'd": "they would",
"they'd've": "they would have","they'll": "they will",
"they'll've": "they will have", "they're": "they are","they've": "they have",
"to've": "to have","wasn't": "was not","we'd": "we would",
"we'd've": "we would have","we'll": "we will","we'll've": "we will have",
"we're": "we are","we've": "we have", "weren't": "were not","what'll": "what will",
"what'll've": "what will have","what're": "what are", "what've": "what have",
"when've": "when have","where'd": "where did", "where've": "where have",
"who'll": "who will","who'll've": "who will have","who've": "who have",
"why've": "why have","will've": "will have","won't": "will not",
"won't've": "will not have", "would've": "would have","wouldn't": "would not",
"wouldn't've": "would not have","y'all": "you all", "y'all'd": "you all would",
"y'all'd've": "you all would have","y'all're": "you all are",
"y'all've": "you all have", "you'd": "you would","you'd've": "you would have",
"you'll": "you will","you'll've": "you will have", "you're": "you are",
"you've": "you have"}
# Regular expression for finding contractions
contractions_re=re.compile('(%s)' % '|'.join(contractions_dict.keys()))
# Function for expanding contractions
def expand_contractions(text,contractions_dict=contractions_dict):
def replace(match):
return contractions_dict[match.group(0)]
return contractions_re.sub(replace, text)
# Expanding Contractions in the reviews
online_articles['cleaned']=online_articles['cleaned'].apply(lambda x:expand_contractions(x))
for index,text in enumerate(online_articles['cleaned'][45:50]):
print('Review %d:\n'%(index+1),text)
Review 1: sponsors say that the shootings in garland texas confirm their view of islam as violenceprone but critics say the event was designed to be incendiary and to poison relations at a volatile time when pamela geller and her controversial organization the american freedom defense initiative announced it would hold a cartoon contest in garland texas their plan to satirize and lampoon the founder of islam was intended to have both a defiant and provocative freespeech edge sunday’s contest and its 10000 prize were prompted in part by the paris charlie hebdo massacre in january ms geller said in march as well as the riots in muslim countries sparked by the publication of satirical antimuhammad cartoons by a danish newspaper in 2005 and indeed as if on cue two gunmen with apparent ties to islamic militants overseas tried to storm the heavily secured event in a similar fashion before being shot dead by a local police officer sunday night the incident comes at a time when tensions between some segments of american society and muslims appear to be becoming more fraught – with protests against muslims in texas and antimuslim socialmedia attacks after the release of the film american sniper in that context geller is actions raise questions about speech seen by many as motivated to incite anger and hatred it is an issue geller has faced before two weeks ago she won a federal freespeech case against new york’s metropolitan transportation authority which had refused to put up one of her ads “killing jews is worship that draws us close to allah” – a quote the ad attributes to “hamas mtv” geller’s organization has often clashed with officials in other cities including philadelphia and washington over their incendiary ads some of which compare islam to nazism in 2012 another federal judge ruled that cities could not refuse to post her subway poster that read “in any war between the civilized man and the savage support the civilized man support israel defeat jihad” many supporters of geller and her organization view the violence on sunday as a vindication of their views of islam as an inherently violenceprone religion but for others her relentless campaign to push the boundaries of free speech with intentionally incendiary messages is only poisoning public discourse at a particularly volatile time “and coming as it did right when we the united states of america are really facing a time when we have to question what it is that holds us together i can see this potentially aggravating the alreadychallenging times for dealing with some of these questions about cultural difference diversity and what kind of society we want to be” says gordon coonfield director of graduate studies in communication at villanova university near philadelphia after analyzing some of the submissions to the american freedom defense initiative’s “muhammad art exhibit and cartoon contest” professor coonfield pointed out the similarities of some of the depictions of the prophet muhammad to posters for “der ewige jude” or “the eternal jew” a notorious nazi propaganda “documentary” in one of the cartoons the prophet is depicted as contorted and snarling and as a hooknosed man in a turban holding a bloody knife the caption reads “when it comes to religion i’ve got the edge” the face coonfield notes is nearly identical to the contorted face of “the eternal jew” poster “that strategy for creating a sense of ‘unity’ by lifting up this internal enemy is as old as human civilization and culture” he says “it’s ironic that the kind of thinking that hitler used and the nazis have become famous for using – propaganda to try to create this sense of a collective by creating a strong unquestionably evil other who is right here in our midst so it’s kind of ironic that she’s trying to link some of these things together when that is in fact her message” despite the fact that images depicting the prophet muhammad cut deeply to the heart of muslim identity muslim leaders in texas told their followers not to picket or protest the event on sunday “her words are not just free speech” says linda sarsour executive director of the arab american association of new york “they are inciteful they incite hate against our whole community i was very dismayed by the shooting in garland texas but at the same time pamela geller is not the victim in this situation that we’re in right now” “she intentionally put that event together in hopes that she’d get the response that she received” ms sarsour says “we prayed but not one muslim from the state of texas went out to protest her” she added “muslim leaders specifically told people do not go anywhere near her let her do whatever she does we don’t care and there was no protesting outside – unfortunately except for these two guys from arizona who were already on the radar of the fbi anyway” advocates have tried to counter geller’s free political expressions with ad campaigns of a different tone in 2012 a coalition called rabbis for human rights responded to her “support the civilized man” poster with an opposing message that read “in the choice between love and hate choose love help stop bigotry against our muslim neighbors” and last week the makers of the satirical film “the muslims are coming” launched a humorous series of subway and bus ads to counter geller is “the muslims are coming and they shall strike with hugs so fierce you’ll end up calling your grandmother and telling her that you love her” but in an era in which the islamic state the tsarnaev trial and the lingering aftermath of 911 still inflame fears about islam many worry that sunday’s violence will exacerbate the current tensions “free speech is about being open to listening to the ideas you hate the most that you disagree with the most and i feel this group in particular is hiding behind this free speech rhetoric” coonfield says “this can’t become the poster child for christianity versus islam or the west versus the middle east we have to maintain a space where groups that have very different ways of thinking and viewing the world can still come together to talk about it without resorting to this kind of craziness” Review 2: drug and substance abuse has ruined and taken the lives of many substance addiction or abuse happens to be a complicated and complex disease which gradually gnaws the addict of their physical Review 3: by amanda froelich this treelike skyscraper is capable of growing 24 acresworth of crops and will be powered entirely by renewable resources by 2050 the world’s population is estimated to reach Review 4: the world watched in shock on wednesday as french satirical publication charlie hebdo became the site of a grisly terror attack gunmen opened fire on a secondfloor editorial meeting killing 12 people in total among them were eight journalists and two police officers journalists felt their profession under fire and several newspapers are taking to their front pages to react editorial cartoons somber black covers and powerful photos from the attack are seen on pages around the world the independent covers their paper with a fictional cover of charlie hebdo libération in paris said we are all charlie the times of london is calls it attack on freedom Review 5: ying and yang the gold and silver setup posted on home » silver » silver news » ying and yang the gold and silver setup no this is not a post about some new chinese law firm instead it’s just an update on the gold and silver markets which while refusing to go further down aren’t making much progress to the upside either from craig hemke tfmetalsreport today’s message a few more slightly positive us economic datapoints and these are likely enough to make a december ff rate hike a fait accompli again though…and i can’t stress this enough… we have traced out a pattern that is remarkably similar to last october and november in the run up to the most recent ff rate hike and what happened beginning the very next day well by now you know the story the week of the october 2o15 fomc produced a high trade in the dec15 contract of 1183 as the fedlines were digested later that week it became clear that the fed was going to raise the ff rate at the december 2015 meeting come hell or high water and they did however take a close look at how gold traded in the days and weeks between the oct15 fomc and the december rate hike price fell from 1180 to 1050 in about five weeks but note that it bottomed well in advance of the actual “news” of the ff rate hike this 10 drop was fueled by a near panic level liquidation of the specs at the comex how bad was it from the cot survey of 102715 just one day before that fateful fomc and fedlines the large specs in gold were net long more than 157000 contracts while the commercials were net short nearly 166000 just five weeks later the net position of the large specs was down to only 10000 contracts with the commercial position reaching an alltime low of just 2911 contracts net short we even speculated at the time that there were some days intraweek where the gold commercials were actually and historically net long well now compare last autumn to our current situation just as back then a ff rate hike is a near certainty at the fomc in december however as you know the anticipatory move in gold began a few weeks ago with the beatdown and purposeful break of both the 50day and 100day moving averages in late september take a look at the current chart and compare it to the one posted above in 2015 we had the october fomc and then two stout down weeks before price turned we slogged through 56 weeks of consolidation and cot improvement before the blast higher began in 2016 we had the september fomc and then two stout down weeks price is attempting to bottom and turn while the cot improves but it doesn’t seem ready just yet to begin moving consistently higher in 2015 the turn in gold began once the actual rate hike took place the rate hike and forecast for 3 or 4 more in 2016 led to dollar strength which led to chinese devaluations which led to emerging market crises which led to equity selloffs and the gold price was already 510 off its lows by late january before the real fun began with the usdjpy falling 10 in early february are we headed down that same path again it certainly appears so as the first major salvos of chinese yuan devaluation were fired last week httpwwwzerohedgecomnews20161020dearjanetchinadevaluesmostaugustyuantumbleslowestsept2010 and just as in 2015 the cot is certainly undergoing a makeover too from the survey of 92716 the large specs in gold were net long 292000 contracts while the commercials were net short 325000 as of last tuesday and just three weeks later the large specs were down to 180000 net long for a reduction of 38 and the commercials were net short 203000 to be sure these are still hefty positions but much more “bullish” than the levels seen through the past summer and now check the full longterm chart you can see again the similarities between now and last fall also be sure to note however that the trend has clearly changed and that price is pointed higher so while we must still deal with the consolidation for a while longer…the ying and yang mentioned in the title of this post…it is clear to me that the trend remains higher and that the nowexpected fomc ff rate hike will be simply another “selltherumor buythenews” type of event for gold and silver this current period of relative quiet should be used to prepare for the next leg up not some sort of new bear market where paper prices are sharply falling use your time wisely and continue to preparestack accordingly tf on sale at sd bullion… this week only… this entry was posted in gold news silver news and tagged craig hemke december rate hike gold update silver update tfmetals report bookmark the permalink post navigation
online_articles['cleaned']=online_articles['cleaned'].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '', x))
for index,text in enumerate(online_articles['cleaned'][45:50]):
print('Review %d:\n'%(index+1),text)
Review 1: sponsors say that the shootings in garland texas confirm their view of islam as violenceprone but critics say the event was designed to be incendiary and to poison relations at a volatile time when pamela geller and her controversial organization the american freedom defense initiative announced it would hold a cartoon contest in garland texas their plan to satirize and lampoon the founder of islam was intended to have both a defiant and provocative freespeech edge sunday’s contest and its 10000 prize were prompted in part by the paris charlie hebdo massacre in january ms geller said in march as well as the riots in muslim countries sparked by the publication of satirical antimuhammad cartoons by a danish newspaper in 2005 and indeed as if on cue two gunmen with apparent ties to islamic militants overseas tried to storm the heavily secured event in a similar fashion before being shot dead by a local police officer sunday night the incident comes at a time when tensions between some segments of american society and muslims appear to be becoming more fraught – with protests against muslims in texas and antimuslim socialmedia attacks after the release of the film american sniper in that context geller is actions raise questions about speech seen by many as motivated to incite anger and hatred it is an issue geller has faced before two weeks ago she won a federal freespeech case against new york’s metropolitan transportation authority which had refused to put up one of her ads “killing jews is worship that draws us close to allah” – a quote the ad attributes to “hamas mtv” geller’s organization has often clashed with officials in other cities including philadelphia and washington over their incendiary ads some of which compare islam to nazism in 2012 another federal judge ruled that cities could not refuse to post her subway poster that read “in any war between the civilized man and the savage support the civilized man support israel defeat jihad” many supporters of geller and her organization view the violence on sunday as a vindication of their views of islam as an inherently violenceprone religion but for others her relentless campaign to push the boundaries of free speech with intentionally incendiary messages is only poisoning public discourse at a particularly volatile time “and coming as it did right when we the united states of america are really facing a time when we have to question what it is that holds us together i can see this potentially aggravating the alreadychallenging times for dealing with some of these questions about cultural difference diversity and what kind of society we want to be” says gordon coonfield director of graduate studies in communication at villanova university near philadelphia after analyzing some of the submissions to the american freedom defense initiative’s “muhammad art exhibit and cartoon contest” professor coonfield pointed out the similarities of some of the depictions of the prophet muhammad to posters for “der ewige jude” or “the eternal jew” a notorious nazi propaganda “documentary” in one of the cartoons the prophet is depicted as contorted and snarling and as a hooknosed man in a turban holding a bloody knife the caption reads “when it comes to religion i’ve got the edge” the face coonfield notes is nearly identical to the contorted face of “the eternal jew” poster “that strategy for creating a sense of ‘unity’ by lifting up this internal enemy is as old as human civilization and culture” he says “it’s ironic that the kind of thinking that hitler used and the nazis have become famous for using – propaganda to try to create this sense of a collective by creating a strong unquestionably evil other who is right here in our midst so it’s kind of ironic that she’s trying to link some of these things together when that is in fact her message” despite the fact that images depicting the prophet muhammad cut deeply to the heart of muslim identity muslim leaders in texas told their followers not to picket or protest the event on sunday “her words are not just free speech” says linda sarsour executive director of the arab american association of new york “they are inciteful they incite hate against our whole community i was very dismayed by the shooting in garland texas but at the same time pamela geller is not the victim in this situation that we’re in right now” “she intentionally put that event together in hopes that she’d get the response that she received” ms sarsour says “we prayed but not one muslim from the state of texas went out to protest her” she added “muslim leaders specifically told people do not go anywhere near her let her do whatever she does we don’t care and there was no protesting outside – unfortunately except for these two guys from arizona who were already on the radar of the fbi anyway” advocates have tried to counter geller’s free political expressions with ad campaigns of a different tone in 2012 a coalition called rabbis for human rights responded to her “support the civilized man” poster with an opposing message that read “in the choice between love and hate choose love help stop bigotry against our muslim neighbors” and last week the makers of the satirical film “the muslims are coming” launched a humorous series of subway and bus ads to counter geller is “the muslims are coming and they shall strike with hugs so fierce you’ll end up calling your grandmother and telling her that you love her” but in an era in which the islamic state the tsarnaev trial and the lingering aftermath of 911 still inflame fears about islam many worry that sunday’s violence will exacerbate the current tensions “free speech is about being open to listening to the ideas you hate the most that you disagree with the most and i feel this group in particular is hiding behind this free speech rhetoric” coonfield says “this can’t become the poster child for christianity versus islam or the west versus the middle east we have to maintain a space where groups that have very different ways of thinking and viewing the world can still come together to talk about it without resorting to this kind of craziness” Review 2: drug and substance abuse has ruined and taken the lives of many substance addiction or abuse happens to be a complicated and complex disease which gradually gnaws the addict of their physical Review 3: by amanda froelich this treelike skyscraper is capable of growing 24 acresworth of crops and will be powered entirely by renewable resources by 2050 the world’s population is estimated to reach Review 4: the world watched in shock on wednesday as french satirical publication charlie hebdo became the site of a grisly terror attack gunmen opened fire on a secondfloor editorial meeting killing 12 people in total among them were eight journalists and two police officers journalists felt their profession under fire and several newspapers are taking to their front pages to react editorial cartoons somber black covers and powerful photos from the attack are seen on pages around the world the independent covers their paper with a fictional cover of charlie hebdo libération in paris said we are all charlie the times of london is calls it attack on freedom Review 5: ying and yang the gold and silver setup posted on home » silver » silver news » ying and yang the gold and silver setup no this is not a post about some new chinese law firm instead it’s just an update on the gold and silver markets which while refusing to go further down aren’t making much progress to the upside either from craig hemke tfmetalsreport today’s message a few more slightly positive us economic datapoints and these are likely enough to make a december ff rate hike a fait accompli again though…and i can’t stress this enough… we have traced out a pattern that is remarkably similar to last october and november in the run up to the most recent ff rate hike and what happened beginning the very next day well by now you know the story the week of the october 2o15 fomc produced a high trade in the dec15 contract of 1183 as the fedlines were digested later that week it became clear that the fed was going to raise the ff rate at the december 2015 meeting come hell or high water and they did however take a close look at how gold traded in the days and weeks between the oct15 fomc and the december rate hike price fell from 1180 to 1050 in about five weeks but note that it bottomed well in advance of the actual “news” of the ff rate hike this 10 drop was fueled by a near panic level liquidation of the specs at the comex how bad was it from the cot survey of 102715 just one day before that fateful fomc and fedlines the large specs in gold were net long more than 157000 contracts while the commercials were net short nearly 166000 just five weeks later the net position of the large specs was down to only 10000 contracts with the commercial position reaching an alltime low of just 2911 contracts net short we even speculated at the time that there were some days intraweek where the gold commercials were actually and historically net long well now compare last autumn to our current situation just as back then a ff rate hike is a near certainty at the fomc in december however as you know the anticipatory move in gold began a few weeks ago with the beatdown and purposeful break of both the 50day and 100day moving averages in late september take a look at the current chart and compare it to the one posted above in 2015 we had the october fomc and then two stout down weeks before price turned we slogged through 56 weeks of consolidation and cot improvement before the blast higher began in 2016 we had the september fomc and then two stout down weeks price is attempting to bottom and turn while the cot improves but it doesn’t seem ready just yet to begin moving consistently higher in 2015 the turn in gold began once the actual rate hike took place the rate hike and forecast for 3 or 4 more in 2016 led to dollar strength which led to chinese devaluations which led to emerging market crises which led to equity selloffs and the gold price was already 510 off its lows by late january before the real fun began with the usdjpy falling 10 in early february are we headed down that same path again it certainly appears so as the first major salvos of chinese yuan devaluation were fired last week httpwwwzerohedgecomnews20161020dearjanetchinadevaluesmostaugustyuantumbleslowestsept2010 and just as in 2015 the cot is certainly undergoing a makeover too from the survey of 92716 the large specs in gold were net long 292000 contracts while the commercials were net short 325000 as of last tuesday and just three weeks later the large specs were down to 180000 net long for a reduction of 38 and the commercials were net short 203000 to be sure these are still hefty positions but much more “bullish” than the levels seen through the past summer and now check the full longterm chart you can see again the similarities between now and last fall also be sure to note however that the trend has clearly changed and that price is pointed higher so while we must still deal with the consolidation for a while longer…the ying and yang mentioned in the title of this post…it is clear to me that the trend remains higher and that the nowexpected fomc ff rate hike will be simply another “selltherumor buythenews” type of event for gold and silver this current period of relative quiet should be used to prepare for the next leg up not some sort of new bear market where paper prices are sharply falling use your time wisely and continue to preparestack accordingly tf on sale at sd bullion… this week only… this entry was posted in gold news silver news and tagged craig hemke december rate hike gold update silver update tfmetals report bookmark the permalink post navigation
The following link helped me apply a method that removes emojies from raw text link
def remove_emojis(text):
emoj = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', text)
online_articles['cleaned']=online_articles['cleaned'].apply(remove_emojis)
for index,text in enumerate(online_articles['cleaned'][45:50]):
print('Review %d:\n'%(index+1),text)
Review 1: sponsors say that the shootings in garland texas confirm their view of islam as violenceprone but critics say the event was designed to be incendiary and to poison relations at a volatile time when pamela geller and her controversial organization the american freedom defense initiative announced it would hold a cartoon contest in garland texas their plan to satirize and lampoon the founder of islam was intended to have both a defiant and provocative freespeech edge sunday’s contest and its 10000 prize were prompted in part by the paris charlie hebdo massacre in january ms geller said in march as well as the riots in muslim countries sparked by the publication of satirical antimuhammad cartoons by a danish newspaper in 2005 and indeed as if on cue two gunmen with apparent ties to islamic militants overseas tried to storm the heavily secured event in a similar fashion before being shot dead by a local police officer sunday night the incident comes at a time when tensions between some segments of american society and muslims appear to be becoming more fraught – with protests against muslims in texas and antimuslim socialmedia attacks after the release of the film american sniper in that context geller is actions raise questions about speech seen by many as motivated to incite anger and hatred it is an issue geller has faced before two weeks ago she won a federal freespeech case against new york’s metropolitan transportation authority which had refused to put up one of her ads “killing jews is worship that draws us close to allah” – a quote the ad attributes to “hamas mtv” geller’s organization has often clashed with officials in other cities including philadelphia and washington over their incendiary ads some of which compare islam to nazism in 2012 another federal judge ruled that cities could not refuse to post her subway poster that read “in any war between the civilized man and the savage support the civilized man support israel defeat jihad” many supporters of geller and her organization view the violence on sunday as a vindication of their views of islam as an inherently violenceprone religion but for others her relentless campaign to push the boundaries of free speech with intentionally incendiary messages is only poisoning public discourse at a particularly volatile time “and coming as it did right when we the united states of america are really facing a time when we have to question what it is that holds us together i can see this potentially aggravating the alreadychallenging times for dealing with some of these questions about cultural difference diversity and what kind of society we want to be” says gordon coonfield director of graduate studies in communication at villanova university near philadelphia after analyzing some of the submissions to the american freedom defense initiative’s “muhammad art exhibit and cartoon contest” professor coonfield pointed out the similarities of some of the depictions of the prophet muhammad to posters for “der ewige jude” or “the eternal jew” a notorious nazi propaganda “documentary” in one of the cartoons the prophet is depicted as contorted and snarling and as a hooknosed man in a turban holding a bloody knife the caption reads “when it comes to religion i’ve got the edge” the face coonfield notes is nearly identical to the contorted face of “the eternal jew” poster “that strategy for creating a sense of ‘unity’ by lifting up this internal enemy is as old as human civilization and culture” he says “it’s ironic that the kind of thinking that hitler used and the nazis have become famous for using – propaganda to try to create this sense of a collective by creating a strong unquestionably evil other who is right here in our midst so it’s kind of ironic that she’s trying to link some of these things together when that is in fact her message” despite the fact that images depicting the prophet muhammad cut deeply to the heart of muslim identity muslim leaders in texas told their followers not to picket or protest the event on sunday “her words are not just free speech” says linda sarsour executive director of the arab american association of new york “they are inciteful they incite hate against our whole community i was very dismayed by the shooting in garland texas but at the same time pamela geller is not the victim in this situation that we’re in right now” “she intentionally put that event together in hopes that she’d get the response that she received” ms sarsour says “we prayed but not one muslim from the state of texas went out to protest her” she added “muslim leaders specifically told people do not go anywhere near her let her do whatever she does we don’t care and there was no protesting outside – unfortunately except for these two guys from arizona who were already on the radar of the fbi anyway” advocates have tried to counter geller’s free political expressions with ad campaigns of a different tone in 2012 a coalition called rabbis for human rights responded to her “support the civilized man” poster with an opposing message that read “in the choice between love and hate choose love help stop bigotry against our muslim neighbors” and last week the makers of the satirical film “the muslims are coming” launched a humorous series of subway and bus ads to counter geller is “the muslims are coming and they shall strike with hugs so fierce you’ll end up calling your grandmother and telling her that you love her” but in an era in which the islamic state the tsarnaev trial and the lingering aftermath of 911 still inflame fears about islam many worry that sunday’s violence will exacerbate the current tensions “free speech is about being open to listening to the ideas you hate the most that you disagree with the most and i feel this group in particular is hiding behind this free speech rhetoric” coonfield says “this can’t become the poster child for christianity versus islam or the west versus the middle east we have to maintain a space where groups that have very different ways of thinking and viewing the world can still come together to talk about it without resorting to this kind of craziness” Review 2: drug and substance abuse has ruined and taken the lives of many substance addiction or abuse happens to be a complicated and complex disease which gradually gnaws the addict of their physical Review 3: by amanda froelich this treelike skyscraper is capable of growing 24 acresworth of crops and will be powered entirely by renewable resources by 2050 the world’s population is estimated to reach Review 4: the world watched in shock on wednesday as french satirical publication charlie hebdo became the site of a grisly terror attack gunmen opened fire on a secondfloor editorial meeting killing 12 people in total among them were eight journalists and two police officers journalists felt their profession under fire and several newspapers are taking to their front pages to react editorial cartoons somber black covers and powerful photos from the attack are seen on pages around the world the independent covers their paper with a fictional cover of charlie hebdo libération in paris said we are all charlie the times of london is calls it attack on freedom Review 5: ying and yang the gold and silver setup posted on home » silver » silver news » ying and yang the gold and silver setup no this is not a post about some new chinese law firm instead it’s just an update on the gold and silver markets which while refusing to go further down aren’t making much progress to the upside either from craig hemke tfmetalsreport today’s message a few more slightly positive us economic datapoints and these are likely enough to make a december ff rate hike a fait accompli again though…and i can’t stress this enough… we have traced out a pattern that is remarkably similar to last october and november in the run up to the most recent ff rate hike and what happened beginning the very next day well by now you know the story the week of the october 2o15 fomc produced a high trade in the dec15 contract of 1183 as the fedlines were digested later that week it became clear that the fed was going to raise the ff rate at the december 2015 meeting come hell or high water and they did however take a close look at how gold traded in the days and weeks between the oct15 fomc and the december rate hike price fell from 1180 to 1050 in about five weeks but note that it bottomed well in advance of the actual “news” of the ff rate hike this 10 drop was fueled by a near panic level liquidation of the specs at the comex how bad was it from the cot survey of 102715 just one day before that fateful fomc and fedlines the large specs in gold were net long more than 157000 contracts while the commercials were net short nearly 166000 just five weeks later the net position of the large specs was down to only 10000 contracts with the commercial position reaching an alltime low of just 2911 contracts net short we even speculated at the time that there were some days intraweek where the gold commercials were actually and historically net long well now compare last autumn to our current situation just as back then a ff rate hike is a near certainty at the fomc in december however as you know the anticipatory move in gold began a few weeks ago with the beatdown and purposeful break of both the 50day and 100day moving averages in late september take a look at the current chart and compare it to the one posted above in 2015 we had the october fomc and then two stout down weeks before price turned we slogged through 56 weeks of consolidation and cot improvement before the blast higher began in 2016 we had the september fomc and then two stout down weeks price is attempting to bottom and turn while the cot improves but it doesn’t seem ready just yet to begin moving consistently higher in 2015 the turn in gold began once the actual rate hike took place the rate hike and forecast for 3 or 4 more in 2016 led to dollar strength which led to chinese devaluations which led to emerging market crises which led to equity selloffs and the gold price was already 510 off its lows by late january before the real fun began with the usdjpy falling 10 in early february are we headed down that same path again it certainly appears so as the first major salvos of chinese yuan devaluation were fired last week httpwwwzerohedgecomnews20161020dearjanetchinadevaluesmostaugustyuantumbleslowestsept2010 and just as in 2015 the cot is certainly undergoing a makeover too from the survey of 92716 the large specs in gold were net long 292000 contracts while the commercials were net short 325000 as of last tuesday and just three weeks later the large specs were down to 180000 net long for a reduction of 38 and the commercials were net short 203000 to be sure these are still hefty positions but much more “bullish” than the levels seen through the past summer and now check the full longterm chart you can see again the similarities between now and last fall also be sure to note however that the trend has clearly changed and that price is pointed higher so while we must still deal with the consolidation for a while longer…the ying and yang mentioned in the title of this post…it is clear to me that the trend remains higher and that the nowexpected fomc ff rate hike will be simply another “selltherumor buythenews” type of event for gold and silver this current period of relative quiet should be used to prepare for the next leg up not some sort of new bear market where paper prices are sharply falling use your time wisely and continue to preparestack accordingly tf on sale at sd bullion… this week only… this entry was posted in gold news silver news and tagged craig hemke december rate hike gold update silver update tfmetals report bookmark the permalink post navigation
This note taken from online article EDA that indicates that removing digits should be done.
Note: Keep in mind that the model that is going to be created should give results according to keywords of the articles, thus digits should not matter. Distinguishing 'fake' or 'real' articles based on keywords is possible but not with digits included (atleast in my project).
def remove_digits(text):
return re.sub("\d+", "", text)
online_articles['cleaned']=online_articles['cleaned'].apply(remove_digits)
for index,text in enumerate(online_articles['cleaned'][45:50]):
print('Review %d:\n'%(index+1),text)
Review 1: sponsors say that the shootings in garland texas confirm their view of islam as violenceprone but critics say the event was designed to be incendiary and to poison relations at a volatile time when pamela geller and her controversial organization the american freedom defense initiative announced it would hold a cartoon contest in garland texas their plan to satirize and lampoon the founder of islam was intended to have both a defiant and provocative freespeech edge sunday’s contest and its prize were prompted in part by the paris charlie hebdo massacre in january ms geller said in march as well as the riots in muslim countries sparked by the publication of satirical antimuhammad cartoons by a danish newspaper in and indeed as if on cue two gunmen with apparent ties to islamic militants overseas tried to storm the heavily secured event in a similar fashion before being shot dead by a local police officer sunday night the incident comes at a time when tensions between some segments of american society and muslims appear to be becoming more fraught – with protests against muslims in texas and antimuslim socialmedia attacks after the release of the film american sniper in that context geller is actions raise questions about speech seen by many as motivated to incite anger and hatred it is an issue geller has faced before two weeks ago she won a federal freespeech case against new york’s metropolitan transportation authority which had refused to put up one of her ads “killing jews is worship that draws us close to allah” – a quote the ad attributes to “hamas mtv” geller’s organization has often clashed with officials in other cities including philadelphia and washington over their incendiary ads some of which compare islam to nazism in another federal judge ruled that cities could not refuse to post her subway poster that read “in any war between the civilized man and the savage support the civilized man support israel defeat jihad” many supporters of geller and her organization view the violence on sunday as a vindication of their views of islam as an inherently violenceprone religion but for others her relentless campaign to push the boundaries of free speech with intentionally incendiary messages is only poisoning public discourse at a particularly volatile time “and coming as it did right when we the united states of america are really facing a time when we have to question what it is that holds us together i can see this potentially aggravating the alreadychallenging times for dealing with some of these questions about cultural difference diversity and what kind of society we want to be” says gordon coonfield director of graduate studies in communication at villanova university near philadelphia after analyzing some of the submissions to the american freedom defense initiative’s “muhammad art exhibit and cartoon contest” professor coonfield pointed out the similarities of some of the depictions of the prophet muhammad to posters for “der ewige jude” or “the eternal jew” a notorious nazi propaganda “documentary” in one of the cartoons the prophet is depicted as contorted and snarling and as a hooknosed man in a turban holding a bloody knife the caption reads “when it comes to religion i’ve got the edge” the face coonfield notes is nearly identical to the contorted face of “the eternal jew” poster “that strategy for creating a sense of ‘unity’ by lifting up this internal enemy is as old as human civilization and culture” he says “it’s ironic that the kind of thinking that hitler used and the nazis have become famous for using – propaganda to try to create this sense of a collective by creating a strong unquestionably evil other who is right here in our midst so it’s kind of ironic that she’s trying to link some of these things together when that is in fact her message” despite the fact that images depicting the prophet muhammad cut deeply to the heart of muslim identity muslim leaders in texas told their followers not to picket or protest the event on sunday “her words are not just free speech” says linda sarsour executive director of the arab american association of new york “they are inciteful they incite hate against our whole community i was very dismayed by the shooting in garland texas but at the same time pamela geller is not the victim in this situation that we’re in right now” “she intentionally put that event together in hopes that she’d get the response that she received” ms sarsour says “we prayed but not one muslim from the state of texas went out to protest her” she added “muslim leaders specifically told people do not go anywhere near her let her do whatever she does we don’t care and there was no protesting outside – unfortunately except for these two guys from arizona who were already on the radar of the fbi anyway” advocates have tried to counter geller’s free political expressions with ad campaigns of a different tone in a coalition called rabbis for human rights responded to her “support the civilized man” poster with an opposing message that read “in the choice between love and hate choose love help stop bigotry against our muslim neighbors” and last week the makers of the satirical film “the muslims are coming” launched a humorous series of subway and bus ads to counter geller is “the muslims are coming and they shall strike with hugs so fierce you’ll end up calling your grandmother and telling her that you love her” but in an era in which the islamic state the tsarnaev trial and the lingering aftermath of still inflame fears about islam many worry that sunday’s violence will exacerbate the current tensions “free speech is about being open to listening to the ideas you hate the most that you disagree with the most and i feel this group in particular is hiding behind this free speech rhetoric” coonfield says “this can’t become the poster child for christianity versus islam or the west versus the middle east we have to maintain a space where groups that have very different ways of thinking and viewing the world can still come together to talk about it without resorting to this kind of craziness” Review 2: drug and substance abuse has ruined and taken the lives of many substance addiction or abuse happens to be a complicated and complex disease which gradually gnaws the addict of their physical Review 3: by amanda froelich this treelike skyscraper is capable of growing acresworth of crops and will be powered entirely by renewable resources by the world’s population is estimated to reach Review 4: the world watched in shock on wednesday as french satirical publication charlie hebdo became the site of a grisly terror attack gunmen opened fire on a secondfloor editorial meeting killing people in total among them were eight journalists and two police officers journalists felt their profession under fire and several newspapers are taking to their front pages to react editorial cartoons somber black covers and powerful photos from the attack are seen on pages around the world the independent covers their paper with a fictional cover of charlie hebdo libération in paris said we are all charlie the times of london is calls it attack on freedom Review 5: ying and yang the gold and silver setup posted on home » silver » silver news » ying and yang the gold and silver setup no this is not a post about some new chinese law firm instead it’s just an update on the gold and silver markets which while refusing to go further down aren’t making much progress to the upside either from craig hemke tfmetalsreport today’s message a few more slightly positive us economic datapoints and these are likely enough to make a december ff rate hike a fait accompli again though…and i can’t stress this enough… we have traced out a pattern that is remarkably similar to last october and november in the run up to the most recent ff rate hike and what happened beginning the very next day well by now you know the story the week of the october o fomc produced a high trade in the dec contract of as the fedlines were digested later that week it became clear that the fed was going to raise the ff rate at the december meeting come hell or high water and they did however take a close look at how gold traded in the days and weeks between the oct fomc and the december rate hike price fell from to in about five weeks but note that it bottomed well in advance of the actual “news” of the ff rate hike this drop was fueled by a near panic level liquidation of the specs at the comex how bad was it from the cot survey of just one day before that fateful fomc and fedlines the large specs in gold were net long more than contracts while the commercials were net short nearly just five weeks later the net position of the large specs was down to only contracts with the commercial position reaching an alltime low of just contracts net short we even speculated at the time that there were some days intraweek where the gold commercials were actually and historically net long well now compare last autumn to our current situation just as back then a ff rate hike is a near certainty at the fomc in december however as you know the anticipatory move in gold began a few weeks ago with the beatdown and purposeful break of both the day and day moving averages in late september take a look at the current chart and compare it to the one posted above in we had the october fomc and then two stout down weeks before price turned we slogged through weeks of consolidation and cot improvement before the blast higher began in we had the september fomc and then two stout down weeks price is attempting to bottom and turn while the cot improves but it doesn’t seem ready just yet to begin moving consistently higher in the turn in gold began once the actual rate hike took place the rate hike and forecast for or more in led to dollar strength which led to chinese devaluations which led to emerging market crises which led to equity selloffs and the gold price was already off its lows by late january before the real fun began with the usdjpy falling in early february are we headed down that same path again it certainly appears so as the first major salvos of chinese yuan devaluation were fired last week httpwwwzerohedgecomnewsdearjanetchinadevaluesmostaugustyuantumbleslowestsept and just as in the cot is certainly undergoing a makeover too from the survey of the large specs in gold were net long contracts while the commercials were net short as of last tuesday and just three weeks later the large specs were down to net long for a reduction of and the commercials were net short to be sure these are still hefty positions but much more “bullish” than the levels seen through the past summer and now check the full longterm chart you can see again the similarities between now and last fall also be sure to note however that the trend has clearly changed and that price is pointed higher so while we must still deal with the consolidation for a while longer…the ying and yang mentioned in the title of this post…it is clear to me that the trend remains higher and that the nowexpected fomc ff rate hike will be simply another “selltherumor buythenews” type of event for gold and silver this current period of relative quiet should be used to prepare for the next leg up not some sort of new bear market where paper prices are sharply falling use your time wisely and continue to preparestack accordingly tf on sale at sd bullion… this week only… this entry was posted in gold news silver news and tagged craig hemke december rate hike gold update silver update tfmetals report bookmark the permalink post navigation
The dataset text has been cleaned using the same procedure used to pre-processing the covid19 dataset expect amount of actions are a little bit lower due to the fact that articles hardly involve hyperlinks in their body text and hardly involve hashtags.
The dataset text has been cleaned successfuly and is ready for stage where stop words are removed and the rest of the corpus to be lemmatized.
The previous sub-section cleaned the data, but it still contains words such as 'The, is, are' that does not add much meaning to the overall corpus and it appear often so we would need to remove them so that we can derive useful information from word cloud data visualization. The words appear in different tense forms as well (past, or present) and by performing lemmatization the word is reversed to its based form according while considering the surrounding context of the word.
#Note the following code takes about 10-20 minutes to work
online_articles['prepared']=online_articles['cleaned'].apply(lambda x: ' '.join([token.lemma_ for token in list(nlp(x)) if (token.is_stop==False)]))
online_articles
| text | label | cleaned | prepared | |
|---|---|---|---|---|
| 0 | Daniel Greenfield, a Shillman Journalism Fello... | FAKE | daniel greenfield a shillman journalism fellow... | daniel greenfield shillman journalism fellow f... |
| 1 | Google Pinterest Digg Linkedin Reddit Stumbleu... | FAKE | google pinterest digg linkedin reddit stumbleu... | google pinter digg linkedin reddit stumbleupon... |
| 2 | U.S. Secretary of State John F. Kerry said Mon... | REAL | us secretary of state john f kerry said monday... | secretary state john f kerry say monday stop p... |
| 3 | — Kaydee King (@KaydeeKing) November 9, 2016 T... | FAKE | — kaydee king kaydeeking november the lesson... | — kaydee king kaydeeke november lesson toni... |
| 4 | It's primary day in New York and front-runners... | REAL | it is primary day in new york and frontrunners... | primary day new york frontrunner hillary clint... |
| ... | ... | ... | ... | ... |
| 6330 | The State Department told the Republican Natio... | REAL | the state department told the republican natio... | state department tell republican national comm... |
| 6331 | The ‘P’ in PBS Should Stand for ‘Plutocratic’ ... | FAKE | the ‘p’ in pbs should stand for ‘plutocratic’ ... | ' p ' pbs stand ' plutocratic ' ' pentagon ' p... |
| 6332 | Anti-Trump Protesters Are Tools of the Oligar... | FAKE | antitrump protesters are tools of the oligarc... | antitrump protester tool oligarchy reform p... |
| 6333 | ADDIS ABABA, Ethiopia —President Obama convene... | REAL | addis ababa ethiopia —president obama convened... | addis ababa ethiopia — president obama convene... |
| 6334 | Jeb Bush Is Suddenly Attacking Trump. Here's W... | REAL | jeb bush is suddenly attacking trump here is w... | jeb bush suddenly attack trump matter \n\n jeb... |
5991 rows × 4 columns
After the text has been cleaned, stop words and the variety of tense forms can affect the results of the model that is going to developed. For example, word like report is treated differently to its past form 'reported'. Therefore the previous step has removed stop words since they usually dont add meaning and words that mean the same thing has been reseversed to the same word and saved in column called 'prepared'
Group all fake online article words together and real online article words
online_articles_group=online_articles[['label','prepared']].groupby(by='label').agg(lambda x:' '.join(x))
online_articles_group.head()
| prepared | |
|---|---|
| label | |
| FAKE | daniel greenfield shillman journalism fellow f... |
| REAL | secretary state john f kerry say monday stop p... |
Create function that create word clouds
def generate_wordcloud(data,title):
wc = WordCloud(width=400, height=330, max_words=150,colormap="Dark2").generate_from_frequencies(data)
plt.figure(figsize=(10,8))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.title('\n'.join(wrap(title,60)),fontsize=13)
plt.show()
Create a matrix whereby the importance of the words are indicated corresponding to the reliability.
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(analyzer='word')
data=cv.fit_transform(online_articles_group['prepared'])
df_dtm = pd.DataFrame(data.toarray(), columns=cv.get_feature_names())
df_dtm.index=online_articles_group.index
df_dtm=df_dtm.transpose()
df_dtm.sample(3)
| label | FAKE | REAL |
|---|---|---|
| befuddle | 1 | 3 |
| grossly | 21 | 18 |
| classic | 64 | 51 |
Display the word cloud
for index,tweet in enumerate(df_dtm.columns):
generate_wordcloud(df_dtm[tweet].sort_values(ascending=False),tweet)
Below find out the results received in EDA notebook
We can notice that now there are no numbers in the word clouds. In addition words such 'Said', 'go' are revesered back its their base form 'say' and 'go' keywords for real online article. the base form of the word replublician became signficantly stronger for real online article important words. while some words still dominate before and after processing such as 'clinton' for both fake and real online article. Lastly 'Trump' became significantly lower in importance for indicating a real article.
online_articles
| text | label | cleaned | prepared | |
|---|---|---|---|---|
| 0 | Daniel Greenfield, a Shillman Journalism Fello... | FAKE | daniel greenfield a shillman journalism fellow... | daniel greenfield shillman journalism fellow f... |
| 1 | Google Pinterest Digg Linkedin Reddit Stumbleu... | FAKE | google pinterest digg linkedin reddit stumbleu... | google pinter digg linkedin reddit stumbleupon... |
| 2 | U.S. Secretary of State John F. Kerry said Mon... | REAL | us secretary of state john f kerry said monday... | secretary state john f kerry say monday stop p... |
| 3 | — Kaydee King (@KaydeeKing) November 9, 2016 T... | FAKE | — kaydee king kaydeeking november the lesson... | — kaydee king kaydeeke november lesson toni... |
| 4 | It's primary day in New York and front-runners... | REAL | it is primary day in new york and frontrunners... | primary day new york frontrunner hillary clint... |
| ... | ... | ... | ... | ... |
| 6330 | The State Department told the Republican Natio... | REAL | the state department told the republican natio... | state department tell republican national comm... |
| 6331 | The ‘P’ in PBS Should Stand for ‘Plutocratic’ ... | FAKE | the ‘p’ in pbs should stand for ‘plutocratic’ ... | ' p ' pbs stand ' plutocratic ' ' pentagon ' p... |
| 6332 | Anti-Trump Protesters Are Tools of the Oligar... | FAKE | antitrump protesters are tools of the oligarc... | antitrump protester tool oligarchy reform p... |
| 6333 | ADDIS ABABA, Ethiopia —President Obama convene... | REAL | addis ababa ethiopia —president obama convened... | addis ababa ethiopia — president obama convene... |
| 6334 | Jeb Bush Is Suddenly Attacking Trump. Here's W... | REAL | jeb bush is suddenly attacking trump here is w... | jeb bush suddenly attack trump matter \n\n jeb... |
5991 rows × 4 columns
We will be using the prepared version of the text since it will be very useful for the model to learn on good data quality.
online_articles_tobe_merged = online_articles[['prepared','label']]
online_articles_tobe_merged
| prepared | label | |
|---|---|---|
| 0 | daniel greenfield shillman journalism fellow f... | FAKE |
| 1 | google pinter digg linkedin reddit stumbleupon... | FAKE |
| 2 | secretary state john f kerry say monday stop p... | REAL |
| 3 | — kaydee king kaydeeke november lesson toni... | FAKE |
| 4 | primary day new york frontrunner hillary clint... | REAL |
| ... | ... | ... |
| 6330 | state department tell republican national comm... | REAL |
| 6331 | ' p ' pbs stand ' plutocratic ' ' pentagon ' p... | FAKE |
| 6332 | antitrump protester tool oligarchy reform p... | FAKE |
| 6333 | addis ababa ethiopia — president obama convene... | REAL |
| 6334 | jeb bush suddenly attack trump matter \n\n jeb... | REAL |
5991 rows × 2 columns
The following code will add the value 'article' to every row since we will need the type column later on in the preparation for prediction part.
online_articles_tobe_merged['type'] = 'article'
<ipython-input-120-aad441df2c93>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy online_articles_tobe_merged['type'] = 'article'
In order to be consistent in the label column I need to transform the values from captial to small.
online_articles_tobe_merged.label = online_articles_tobe_merged.label.map({'REAL':'real','FAKE':'fake'})
C:\Users\mohammed\anaconda3\lib\site-packages\pandas\core\generic.py:5494: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self[name] = value
online_articles_tobe_merged
| prepared | label | type | |
|---|---|---|---|
| 0 | daniel greenfield shillman journalism fellow f... | fake | article |
| 1 | google pinter digg linkedin reddit stumbleupon... | fake | article |
| 2 | secretary state john f kerry say monday stop p... | real | article |
| 3 | — kaydee king kaydeeke november lesson toni... | fake | article |
| 4 | primary day new york frontrunner hillary clint... | real | article |
| ... | ... | ... | ... |
| 6330 | state department tell republican national comm... | real | article |
| 6331 | ' p ' pbs stand ' plutocratic ' ' pentagon ' p... | fake | article |
| 6332 | antitrump protester tool oligarchy reform p... | fake | article |
| 6333 | addis ababa ethiopia — president obama convene... | real | article |
| 6334 | jeb bush suddenly attack trump matter \n\n jeb... | real | article |
5991 rows × 3 columns
online_articles_tobe_merged.shape
(5991, 3)
We have created a subset of the online articles. We also transformed the values of the target variable from being captial to small to be consistent with the covid19 tweet dataset. The data subset is ready to be merged with the covid19 tweets dataset via Uniform merging technique.
In this section I will review the datasets and will clarify with an image which technique I will be using for merging.
covid_tweets_tobe_merged.shape
(6420, 3)
online_articles_tobe_merged.shape
(5991, 3)
covid_tweets_tobe_merged.head()
| prepared | label | type | |
|---|---|---|---|
| 0 | cdc currently report death general discrepan... | real | covid_tweet |
| 1 | state report death small rise tuesday southe... | real | covid_tweet |
| 2 | politically correct woman use pandemic excuse ... | fake | covid_tweet |
| 3 | test laboratory india th august test ... | real | covid_tweet |
| 4 | populous state generate large case count look ... | real | covid_tweet |
online_articles_tobe_merged.head()
| prepared | label | type | |
|---|---|---|---|
| 0 | daniel greenfield shillman journalism fellow f... | fake | article |
| 1 | google pinter digg linkedin reddit stumbleupon... | fake | article |
| 2 | secretary state john f kerry say monday stop p... | real | article |
| 3 | — kaydee king kaydeeke november lesson toni... | fake | article |
| 4 | primary day new york frontrunner hillary clint... | real | article |
Both prepared datasets have the same data characterstics and therefore union techniqueis suitable for merging.
We can use the pd.concat method available in pandas in order to merge the datasets in a union fashion on the rows axis (which is the default parameter)
df = pd.concat([online_articles_tobe_merged, covid_tweets_tobe_merged])
df.shape
(12411, 3)
df.head()
| prepared | label | type | |
|---|---|---|---|
| 0 | daniel greenfield shillman journalism fellow f... | fake | article |
| 1 | google pinter digg linkedin reddit stumbleupon... | fake | article |
| 2 | secretary state john f kerry say monday stop p... | real | article |
| 3 | — kaydee king kaydeeke november lesson toni... | fake | article |
| 4 | primary day new york frontrunner hillary clint... | real | article |
df.tail()
| prepared | label | type | |
|---|---|---|---|
| 6415 | tiger test positive covid stay away pet bird | fake | covid_tweet |
| 6416 | autopsy prove covid blood clot pneumonia ought... | fake | covid_tweet |
| 6417 | post claim covid vaccine develop cause widespr... | fake | covid_tweet |
| 6418 | aamir khan donate cr pm relief care fund | fake | covid_tweet |
| 6419 | day case covid acquire locally unknown sourc... | real | covid_tweet |
Noticed that the indexes are wrong so lets fix them.
df.reset_index(inplace=True)
df.drop('index',axis=1,inplace=True)
df.head()
| prepared | label | type | |
|---|---|---|---|
| 0 | daniel greenfield shillman journalism fellow f... | fake | article |
| 1 | google pinter digg linkedin reddit stumbleupon... | fake | article |
| 2 | secretary state john f kerry say monday stop p... | real | article |
| 3 | — kaydee king kaydeeke november lesson toni... | fake | article |
| 4 | primary day new york frontrunner hillary clint... | real | article |
df.tail()
| prepared | label | type | |
|---|---|---|---|
| 12406 | tiger test positive covid stay away pet bird | fake | covid_tweet |
| 12407 | autopsy prove covid blood clot pneumonia ought... | fake | covid_tweet |
| 12408 | post claim covid vaccine develop cause widespr... | fake | covid_tweet |
| 12409 | aamir khan donate cr pm relief care fund | fake | covid_tweet |
| 12410 | day case covid acquire locally unknown sourc... | real | covid_tweet |
The merged dataset output logical results since the amount of rows in covid tweets was 6420 and the amount of online articles was 5991 while both having 3 features. Therefore the expected combined dataset should be 6420 + 5991 which is 12411 and the features should be the same (we have the exact same features in terms of name and consistency of the target variable.) The indexes of the merged dataset was off and has been successfully fixed.
df.to_csv('fakenews_cleaned.csv')
To conclude this notebook, after the data has been explored and found out issues in the text. This notebook gone through whitespaces in column names, missing values and duplicated data in both covid19 and online article datasets. In addition, the problems of text data have been resolved and the important words for reliability of each covid19 tweets and online articles have been visualized and compared before and after pre-processing procedure. After preparing the datasets and making sure that both datasets have the same characterstics Union merging fashion was suitable approach for combining the prepared datasets and after performing the merging operation the results was logical. Finally the combined cleaned dataset has been saved into a csv for preparation (modelling) phase.